Characterization of the complete chloroplast genome of Brassica oleracea var. italica and phylogenetic relationships in Brassicaceae

Broccoli (Brassica oleracea var. italica) is an important B. oleracea cultivar, with high economic and agronomic value. However, comparative genome analyses are still needed to clarify variation among cultivars and phylogenetic relationships within the family Brassicaceae. Herein, the complete chloroplast (cp) genome of broccoli was generated by Illumina sequencing platform to provide basic information for genetic studies and to establish phylogenetic relationships within Brassicaceae. The whole genome was 153,364 bp, including two inverted repeat (IR) regions of 26,197 bp each, separated by a small single copy (SSC) region of 17,834 bp and a large single copy (LSC) region of 83,136 bp. The total GC content of the entire chloroplast genome accounts for 36%, while the GC content in each region of SSC,LSC, and IR accounts for 29.1%, 34.15% and 42.35%, respectively. The genome harbored 133 genes, including 88 protein-coding genes, 37 tRNAs, and 8 rRNAs, with 17 duplicates in IRs. The most abundant amino acid was leucine and the least abundant was cysteine. Codon usage analyses revealed a bias for A/T-ending codons. A total of 35 repeat sequences and 92 simple sequence repeats were detected, and the SC-IR boundary regions were variable between the seven cp genomes. A phylogenetic analysis suggested that broccoli is closely related to Brassica oleracea var. italica MH388764.1, Brassica oleracea var. italica MH388765.1, and Brassica oleracea NC_0441167.1. Our results are expected to be useful for further species identification, population genetics analyses, and biological research on broccoli.


Introduction
Broccoli is a vegetable with a high nutrient content in Brassica oleracea. It possesses of wide range of nutrients, including vitamins A and K, antioxidants, β-carotene, calcium, riboflavin, and iron [1], as well as phytochemicals, such as phenols, flavonoids, glucosinolates, minerals, and selenium. The consumption of broccoli is beneficial to human health [2], exerting anti- inflammatory, anti-obesity, cholesterol-lowering, and anti-carcinogenic effects as well as high antioxidant activity [3,4]. Broccoli was introduced to China as a special vegetable and was initially cultivated on a small scale. Over the past few decades, broccoli has played in increasing role in the booming vegetable industry and has become an increasing source of income for farmers. Chloroplasts (cp) are crucial organelles in plant cells as a metabolic center of cellular reactions [5]. They play critical roles in carbohydrate metabolism, photosynthesis, and various molecular processes as well as in the regulation of physiology, growth, development, and stress responses [6,7]. The typical cp genome of angiosperms has a quadripartite structure with a small single-copy (SSC) region and large single-copy region (LSC) region divided by two inverted repeat (IR) regions [8]. The gene content and organization of cp genomes are highly conserved; however, IR expansions and contractions, gene loss, inversions, and rearrangements have been reported [9]. Owing to their high conservation and slow rates of evolution, cp genomes are invaluable for phylogenetic classification, DNA barcoding, and genetic engineering [10,11].
Broccoli crops from the Brassica oleracea group likely originated in the Mediterranean basin and are linked to closely related species, e.g., Brassica cretica and Brassica montana [12]. The selection and domestication processes led to the spread and exchange of genetic materials with other Brassica oleracea cultivars. Intense trade and human migration among several continents promoted the spread of the crop worldwide since the 15 th century, resulting in the development of new cultivars and hybrids, mainly in European and Asian countries. Adaptation to different soil and climatic conditions resulted from the cultivation and selection of genotypes with beneficial agronomical and qualitative traits [13].
Advances in high-throughput Illumina genome sequencing technologies have provided an opportunity to obtain and analyze the complete chloroplast genome of broccoli for analyses of its molecular and genomic characteristics. A sufficient knowledge of its genetic diversity is essential for the development of efficient strategies for its exploitation. Several  In this study, we sequenced and assembled the complete cp genome of broccoli cultivar 2001B (B. oleracea var. italica) for analyses of phylogenetic relationships with B. oleracea cultivars and other members of Brassicaceae. In particular, we de novo sequenced and assembled the complete cp genome of broccoli using the Illumina HiSeq2500 platform, followed by gene annotation and structural analyses, the identification of simple sequence repeat (SSR) markers, and reconstruction of evolutionary relationships among species in Brassicaceae. These results will hopefully improve our understanding of the cp genome and provide a theoretical basis for future scientific research on broccoli.

DNA extraction and sequencing
Broccoli was planted in the experimental field of Zhenjiang Institute of Agricultural Sciences in Jurong, Jiangsu Province, China (N31˚58', E119˚9'). Fresh leaves were collected and wrapped with tin foil, frozen with liquid nitrogen, and immediately stored at -80˚C. Total genomic DNA was extracted from approximately 5 g of leaves with Plant DNA Isolation Reagent (Takara, USA) following the manufacturer's protocol. An Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and NanoDrop 2000 Microspectrophotometer were used to evaluate the quality and integrity of the extracted DNA. After purification, the the Jiangsu Agriculture Science and Technology Innovation Fund in the form of funds to LXS [CX (18) DNA was employed to build a sequencing library according to the manufacturer's instructions. The Illumina HiSeq2500 platform (San Diego, CA, USA) was utilized to construct paired-end (PE) libraries with insert sizes of 150 bp and sequenced according to standard protocols, including sample quality testing, library construction and quality testing, and library sequencing.

Cp genome assembly and annotation
High-quality clean reads were generated by trimming and filtering the low-quality reads and sequencing adapters using Trimmomatic v. 0.3649. The clean reads were mapped onto the available cp genome reference of B. oleracea var. capitata (NCBI accession: KX681654.1) using Bowtie2(version 2. 2. 5) [14] with default parameters and preset options. All cp-like reads were assembled into contigs using SPAdes3.10.1 [15]. Then, the contigs were aligned again on the Brasscia oleracea var. capitata reference using the BLAST algorithm. The generated contigs and mate-pair reads were used for scaffolding using SSPACE(version 3.0) [16] to form a circular genome.
Sequencing data and gene annotations of B. oleracea var. italica were submitted to NCBI GenBank database (Accession Number: MN649876.1).

Phylogenetic analysis
The phylogenetic trees were constructed by aligning total chloroplast protein-coding sequences from 31 species in Brassicaceae obtained from the GenBank database, using Carica papaya (NC_010323.1) as an outgroup. MAFFTA version 7.017 [25] was used generate sequence alignments. FastTree v. 2.1.10 [26] was employed to construct a phylogenetic tree by the maximum likelihood (ML) method with the GTRGAMMA model and 1000 bootstrap replicates to evaluate node support.

Characteristics of the broccoli cp genome
The newly generated genome (MN649876.1) was a typical quadripartite circular molecule 153,364 bp in length, containing a pair of two IR (IRA and IRB) regions of 26,197 bp each, separated by a SSC region of 17,834 bp and a LSC region of 83,136 bp (Fig 1 and Table 1). The AT and GC contents of overall cp genome were 63.64% and 36.36%, respectively. The cp genome had a biased base composition (31.36% A, 32.28% T, 17.86% G, and 18.5% C) with an overall GC content of 36.36%. The GC contents of the IR, LSC, and SSC regions were 42.35%, 34.15%, and 29.1%, respectively.

Examination of codon usage frequency
According to the coding sequence (CDS), the relative synonymous codon usage frequency (RSCU) and codon usage frequency were estimated ( Table 6, Fig 2). All protein-coding genes in the cp genome were composed of 26,681 codons. Among these codons, the termination codons were UAA, UAG, and UGA. AUG encoding methionine had the highest RSCU value (2.9901). The most abundant amino acid in the protein-coding genes was leucine (2829 codons, approximately 10.6% of the total), compared with only 325 codons (1.22%) for cysteine. The codon-anticodon recognition patterns of the cp genome showed that 30 tRNAs contained codons corresponding to 20 essential amino acids for protein biosynthesis. The AT contents at the first, second, and third codon positions were 55.3%, 62.53%, and 71.21%, respectively. Moreover, of all 66 codons, the RSCU values for 31 codons were >1, and most (13/16, 93.5%) ended with base A or U, whereas 34 codons had RSCU values of <1, and most of these (16/15, 91.2%) ended with base C or G. Trp was encoded by only a UGG codon, indicating no biased usage (RSCU = 1).
A total of 92 SSRs, including 66 mononucleotides (P1), 18 dinucleotides (P2), 3 trinucleotides (P3), and 5 tetranucleotides were explored. Most were distributed in the LSC (58, 63.00%) and SSC regions (22,23.9%), with some in the IR region (12,13.00%) (Tables 8 and 9, Fig 3). One SSRs belonged to the C repeat units and the others belonged to the A and T types (98.49%), while dinucleotides included TA and AT repeats. Trinucleotides were the last prevalent with the lowest number of repeat units (3). Moreover, 37 repeats were found in different genes, and the remaining were found in intergenic regions.  (Fig 4). In the figure, JLB, JLA, JSB, JSA represent for IRb/LSC, IRa/ LSC, IRb/SSC, and IRa/SSC junction, respectively. The IR sizes of the LSC, IR, and SSC regions were similar in the cp genomes of the seven species, and the IR length varied from 26,035 bp in Table 5. Differences in genome size and genome divergence (SNPs and Indels) between the newly generated genome (MN649876.1) and other Brassica oleracea genomes.    B. napus to 26,459 bp in C. bursa−pastoris (accession number: AP009371). The JLB border was within the coding region of rps19 in the above seven species and only 1 base par difference in location across different cp genomes. The two genes ycf1 and ndhF crossed the JSB junction.

Brasscia gongylodes
Most of the ycf1 gene in the seven species was located in the IRB region and 1-4 bp was located in the SSC region. Overlap between ycf1 and ndhF was detected at the JSB boundary in all studied cp genomes, with lengths of 35 bp to 38 bp. The ycf1 gene crossed the JSA region in all cp genomes, and its length reflected changes in the JSA region. The tRNA noncoding gene trnH-GUG in the seven species were all within the LSC region, located 2-30 bp from the JLA boundary. These results suggested that the IR border shifts were relatively minor, involving only a small number of genes, with differences in gene overlap lengths and the distance of trnH-GUG at the junction of JLA boundaries.

Phylogenetic analysis
cpDNA-based phylogenetic analyses have provided insight into evolutionary relationships, population genetics, and classification in different plant taxa [27]. To investigate the

Discussion
B. oleracea var. italica is an important vegetable among B. oleracea cultivars. In general, the gene content and genome organization of land plant chloroplast genomes are more highly conserved than those of mitochondrial and nuclear genomes. However, gene losses and

PLOS ONE
inversions had been reported in Asteraceae, Leguminosae, and Gentianaceae [28][29][30]. In the present study, we compared the complete cp genomes and gene annotations of various B. oleracea cultivars with data available in the GenBank database. The size of the cp genome obtained in this study was similar to those of other B. oleracea varieties. However, the number of annotated genes differed among genomes; this may be explained by incomplete data, gene losses, or interspecific differences. Our results indicated that the DNA GC content was not evenly distributed among genomic regions. The GC content in the IR region was higher than those in other regions, possibly because the GC content (an indicator of species relationships) of the four rRNAs in this region was high [11,31]. The newly sequenced broccoli genome contained 133 genes, with high conservation in composition and arrangement, including self-replication genes, photosynthetic genes, other functional genes, and genes with unknown functions, consistent with previous research [32]. Furthermore, 23 genes contained one intron or two introns, and trnR-UKK had the largest intron. Introns play crucial roles in the regulation of gene expression depending on conditions and on the location [33]. Coding usage is a key factor in cp genome evolution. In the broccoli cp genome, the most and least frequent amino acids were leucine and cysteine, respectively, as observed in other angiosperm genomes, such as Ananas comosus, Decaisnea insignis, Nasturtium officinale, and Magnolia zenii [34][35][36]. In the broccoli cp genome, AT was preferred over GC, especially at the second and third codon positions (62.53% and 71.21%, respectively), consistent with results obtained for many terrestrial species [37].
A repeat analysis revealed 12 forward, 20 palindromic, and 3 reverse repeats in the broccoli cp genome. Most of these repeats were located in intron sequences, intergenic spacers, and the ycf gene, but several occurred in CDS regions and tRNAs. Repeat sequences are involved in sequence variation, genome rearrangements, and many rearrangement endpoints in algal and angiosperm genomes [38,39]. The organization of cp genome sequences is highly conserved and the SSR primer for cp genomes can be inherited across genera and species. Accordingly,

PLOS ONE
SSRs are widely used as molecular markers for genetic linkage map construction, population genetic analyses, polymorphism identification, plant breeding, and taxonomic analyses [40]. A total of 92 SSRs were obtained in this study, and 66 (71.7%) SSRs belonged to the P1 type, among which 65 (70.7%) belonged to A and T repeat units, while TA and AT repeats belonged to the P2 type. These findings agree with previous results [41,42]. The phylogenetic analysis yielded 53 notes with bootstrap values, among which 21 and 36 notes had bootstrap values greater than 100% and 90%, respectively. In this present study, the phylogenetic trees demonstrated that Brassica nigra and S. arvensis were clustered into one subgroup, which was consistent with others research [43]. And the newly generated genome (Accession Number: MN649876.1) was closely related to NC_041167.1.
Plant cp genomes are considered highly conserved; however, the sizes and LSC/IRb/SSC/ IRa boundaries change due to contraction or expansion at the borders of the IR region [44]. Our results indicated that divergence in the IR border between seven species was related to the different positions of four genes, rps19, ycf1, ndhF, and trnH-GUG, in agreement with previous results [45,46]. It is worth noting that the ycf1 gene was found at the JSA boundary from 1022 to 1034 bp in the IRA region in all cp genomes analyzed. Besides, the trnH gene located at the LSC region in all tested cp genomes, but the distance to the JLA boundary varies from 2-30bp. Combining the above results, we indicate that these seven cp genomes were relatively conserved, and the boundary divergence in the JSA and JLA in these species was the main reason for the expansion and contraction of the IR region.